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Abstract 

This paper introduces a new type of unsuper- 
vised learning algorithm, based on the align- 
ment of sentences and Harris's (1951) notion 
of interchangeability. The algorithm is ap- 
plied to an untagged, unstructured corpus 
of natural language sentences, resulting in 
a labelled, bracketed version of the corpus. 
Firstly, the algorithm aligns all sentences in 
the corpus in pairs, resulting in a partition of 
the sentences consisting of parts of the sen- 
tences that are similar in both sentences and 
parts that are dissimilar. This information 
is used to find (possibly overlapping) con- 
stituents. Next, the algorithm selects (non- 
overlapping) constituents. Several instances 
of the algorithm are applied to the ATIS cor- 
pus (Marcus et al., 1993) and the OVIS 1 cor- 
pus (Bonnema et al., 1997). Apart from the 
promising numerical results, the most strik- 
ing result is that even the simplest algorithm 
based on alignment learns recursion. 

1. Introduction 

This paper introduces a new type of grammar learn- 
ing algorithm, which uses the alignment of sentences to 
find possible constituents in the form of labelled brack- 
ets. When all possible constituents are found, the al- 
gorithm selects the best constituents. We call this type 
of algorithm Alignment-Based Learning (ABL). 

The main goal of the algorithm is to automatically 
find constituents in plain sentences in an unsupervised 
way. The only information the algorithm uses stems 
from these sentences; no additional information (for 
example POS-tags) is used. 

The underlying idea behind our algorithm is Harris's 
notion of interchangeability; two constituents of the 
same type can be replaced. ABL finds constituents by 

1 0penbaar Vervoer Informatie Systeem (OVIS) stands 
for Public Transport Information System. 



looking for parts of sentences that can be replaced and 
assumes that these parts of the sentences are probably 
constituents, which is Harris's notion reversed. 

At some point the algorithm may have learned possible 
constituents that overlap. Since generating results is 
done by comparing a learned structure to the structure 
in the corpus, the algorithm needs to disambiguate 
conflicting constituents. This process continues one 
tree structure covering the sentence remains. 

This paper is organised as follows. We start out by 
describing the algorithm in detail. We then report 
experimental results from various instances of the al- 
gorithm. We discuss the algorithm in relation to other 
grammar learning algorithms, followed by description 
of some future research. 

2. Algorithm 

In this section we describe an algorithm that learns 
structure in the form of labelled brackets on a corpus of 
natural language sentences. This corpus is a selection 
of plain sentences containing no brackets or labels. 

The algorithm was developed on several small corpora. 
These corpora indicated some problems when simply 
applying Harris's idea to learn structure. These prob- 
lems were solved by introducing two phases: alignment 
learning and selection learning, which will now be de- 
scribed in more detail. 

2.1 Alignment Learning 

The first phase of the algorithm is called alignment 
learning. It finds possible constituents by aligning all 
plain sentences from memory in pairs. Aligning un- 
covers parts of the sentences that are similar in both 
sentences and parts that are dissimilar. Finally, the 
dissimilar parts are stored as possible constituents of 
the same type. This is shown by grouping the parts 
and labelling them with a non-terminal. 

Finding constituents like this is based on Harris's no- 
tion of interchangeability. Harris (1951) states that 



two constituents of the same type can be replaced. 
The alignment learning algorithm tries to find parts of 
sentences that can be replaced, indicating that these 
parts might be constituents. 

We have included a simple example taken from the 
ATIS corpus to give a visualisation of the algorithm 
in Table 1. It shows that that Show me is similar in 
both sentences and flights from Atlanta to Boston and 
the rates for flight 1943 are dissimilar. The dissimi- 
lar parts are then taken as possible constituents of the 
same type. In this example there are only two dis- 
similar parts, but if there were more dissimilar parts, 
they would also be grouped. However, a different non- 
terminal would be assigned to them (as can be seen in 
sentences 3 and 4 in Table 2). 

Table 1. Bootstrapping structure 

Show me flights from Atlanta to Boston 
Show me the rates for flight 1943 
Show me ( flights from Atlanta to Boston )x 
Show me ( the rates for flight 1943 )x 



Note that if the algorithm tries to align two completely 
dissimilar sentences, no similar parts can be found at 
all. This means that no inner structure can be learned. 
The only constituents that can be learned are those on 
sentence level, since the entire sentences can be seen 
as dissimilar parts. 

2.1.1 Aligning 

The alignment of two sentences can be accomplished 
in several ways. Three different algorithms have been 
implemented, which will be discussed in more detail 
here. 

Firstly, we implemented the edit distance algorithm 
by Wagner and Fischer (1974) to find the similar word 
groups in the sentences. It finds the minimum edit cost 
to change one sentence into the other based on a pre- 
defined cost function 7. The possible edit operations 
are insertion, deletion and substitution, which are used 
to change one sentence in the other. It is possible to 
find the words in the sentences that match (i.e. no 
edit operation) . These words combined are the similar 
parts of the two sentences. 

The cost function of the edit distance algorithm can 
be defined to find the longest common subsequences 
in two sentences. The cost function 7 returns 1 for an 
insert or delete operation, if the two arguments are 
the same and 2 if the two arguments are different. We 
will call this algorithm the default 7. 



Unfortunately, this approach has the disadvantage de- 
picted in Table 2. Here, the algorithm aligns sen- 
tences 1 and 2. Default 7 finds that San Francisco 
is the longest common subsequence. This is correct, 
but results in an unwanted syntactic structure as can 
be seen in sentences 3 and 4. 

Table 2. Ambiguous alignments 

1 from San Francisco to Dallas 

2 from Dallas to San Francisco 

3 from ( )x x San Francisco ( to Dallas )x 2 

4 from ( Dallas to )x x San Francisco ( )x 2 

5 from ( San Francisco to )x 3 Dallas ( )x 4 

6 from ( )x 3 Dallas ( to San Francisco )x 4 

7 from ( San Francisco )x 5 to ( Dallas )x 6 

8 from ( Dallas )x 5 to ( San Francisco )x e 

The problem is that aligning San Francisco results 
in constituents that differ greatly in length. In other 
words, the position of San Francisco in both sentences 
differs significantly. Similarly, aligning Dallas results 
in unintended constituents (see sentences 5 and 6 in 
Table 2), but aligning to would not (as can be seen 
in sentences 7 and 8), since to resides more "in the 
middle" of both sentences. 

This problem is solved by redefining the cost function 
of the edit distance algorithm to prefer matches be- 
tween words that have similar offsets in the sentences. 
When two words have similar offsets, the cost will be 
low, but when the words are far apart, the cost will 
be higher. We will call this algorithm biased 7. The 
biased 7 is similar to the default 7, only in case of a 
match, the biased 7 returns 

i\ «2 s\ + s 2 
Si s 2 2 

where i\ and ii are the indices of the considered words 
in sentence 1 and sentence 2 while s\ and si are the 
lengths of sentence 1 and sentence 2 respectively. 

Although biased 7 solves the problem in Table 2, one 
may argue if this solution is always valid. It may be 
the case that sometimes a "long distance" alignment 
is preferable. Therefore, we implemented a third algo- 
rithm, which does not use the edit distance algorithm. 
It finds all possible alignments. In the example of Ta- 
ble 2 it finds all three mutually exclusive alignments. 

2.1.2 Grouping 

The previous section described algorithms that align 
two sentences and find parts of the sentences that are 



similar. The dissimilar parts of the sentences, i.e. 
the rest of the sentences, are considered possible con- 
stituents. Every pair of new possible constituents in- 
troduces a new non-terminal. 2 

Table 3. Learning with a partially structured sentence and 
an unstructured sentence 

1 What does (AP57 restriction)^ mean 

2 What does aircraft code D8S mean 

3 What does (AP57 restriction) x x mean 

4 What does (aircraft code D8S)xi mean 

At some point the system may find a constituent that 
was already present in one of the two sentences. This 
may occur when a new sentence is compared to a par- 
tially structured sentence in memory. No new type 
is introduced, instead the type of the new constituent 
will be the same type of the constituent in memory. 
(See Table 3 for an example.) 

Table 4. Learning with two partially structured sentences 

1 Explain the (meal code)xi 

2 Explain the (restriction AP)x 2 

3 Explain the (meal code)^ 

4 Explain the (restriction AP)x 3 

A more complex case may occur when two partially 
structured sentences are aligned. This happens when a 
new sentence that contains some structure, which was 
learned in a previous step, is compared to a sentence 
in memory. When the alignment of these two sen- 
tences yields a constituent that was already present in 
both sentences, the types of these constituents are then 
merged. All constituents of these types in memory are 
updated so they have the same type. This reduces the 
number of non-terminals in memory as can be seen in 
Table 4. 

2.2 Selection Learning 

The algorithm so far may generate constituents that 
overlap with other constituents. In Table 5 sentence 2 
receives one structure when aligned with sentence 1 
and a different structure when sentence 3 (which is the 
same as sentence 2) is aligned with sentence 4. The 
constituents in sentence 2 and 3 are overlapping. 

This is solved by adding a selection method that se- 
lects constituents until no overlaps remain. (During 
the alignment learning phase all possible constituents 
are remembered, even if they overlap.) We have im- 
plemented three different methods, although other im- 

In our implementation we used natural numbers to de- 
note the different types. 



Tabie 5. Overlapping constituents 

1 ( Book Delta 128 )x from Dallas to Boston 

2 ( Give me all flights )x from Dallas to Boston 

3 Give me ( all flights from Dallas to Boston )y 

4 Give me ( help on classes )y 

plementations may be considered. Note that only one 
of the methods is used at a time. 

2.2.1 Incremental Method of Constituent 
Selection 

The first selection method is based on the assumption 
that once a constituent is learned and remembered, it 
is correct. When the algorithm finds a possible con- 
stituent that overlaps with an older constituent, the 
new constituent is considered incorrect. We call this 
method incr (after incremental). 

The main disadvantage of this method is that once an 
incorrect constituent has been learned, it will never be 
corrected. The incorrect constituent always remains 
in memory. 

2.2.2 Probabilistic Methods of Constituent 
Selection 

To solve the disadvantage of the incr method, two ad- 
ditional (probabilistic) constituent selection methods 
have been implemented. 

The second selection method computes the probabil- 
ity of a constituent counting the number of times the 
words in the constituent have occurred as a constituent 
in the learned text, normalized by the total number of 
constituents. 

\c' £ C : yield(c') = yield(c)\ 

F leaf{C) = ■ 

where C is the entire set of constituents. This method 
is called leaf since we count the number of times the 
leaves (i.e. the words) of the constituent co-occur in 
the corpus as a constituent. 

The third method computes the probability of a con- 
stituent using the occurrences of the words in the con- 
stituent and its non-terminal (i.e. it is a normalised 
probability of leaf). 

Pbranch(c\rOOt(c) = r) = 

\c' e C : yield(c') — yield(c) A root(c') = r\ 
\c" i C : root(c") = r\ 

The probability is based on the root node and the 
terminals of the constituent, which can be seen as a 



branch (of depth one) in the entire structure of the 
sentence, hence the name branch. 

These two methods are probabilistic in nature. The 
system computes the probability of the constituent us- 
ing the formula and then selects constituents with the 
highest probability. These methods are accomplished 
after alignment, since more specific information (in the 
form of better counts) can be found at that time. 

2.2.3 Combination Probability 

Two methods to determine the probability of a con- 
stituent have been described. Since more than 
two constituents can overlap, a combination of non- 
overlapping constituents has to be selected. There- 
fore, we need to know the probability of a combination 
of constituents. The probability of a combination of 
constituents is the product of the probabilities of the 
constituents as in SCFGs (cf. Booth, 1969). 

Using the product of the probabilities of constituents 
results in a trashing effect, since the product of proba- 
bilities is always smaller than or equal to the separate 
probabilities. Instead, we use a normalised version, 
the geometric mean 3 (Caraballo & Charniak, 1998). 

However, the geometric mean does not have a pref- 
erence for richer structures. When there are two (or 
more) constituents that have the same probability, the 
constituents have the same probability as their combi- 
nation and the algorithm selects one at random. 

To let the system prefer more complex structure when 
there are more possibilities with the same probability, 
we implemented the extended geometric mean. The 
only difference with the (standard) geometric mean 
is that when there are more possibilities (single con- 
stituents or combinations of constituents) with the 
same probability, this system selects the one with the 
most constituents. To distinguish between systems 
that use the geometric mean and those that use the 
extended geometric mean, we add a + to the name of 
the methods that use the extended geometric mean. 

Instead of computing the probabilities of all possible 
combinations of constituents, we have used a Viterbi 
(1967) style algorithm optimization to efficiently select 
the best combination of constituents. 

3. Test Environment 

In this section we will describe the systems we have 
tested and the metrics we used. 



3.1 System Variables 

The ABL algorithm consists of two phases, alignment 
learning and selection learning. For both phases, we 
have discussed several implementations. 

The alignment learning phase builds on the alignment 
algorithm. We have implemented three algorithms: 
default 7, biased 7 and all alignments. 

After the alignment learning phase, the selection learn- 
ing phase takes place, which can be accomplished in 
different ways: incr (the first constituent is correct), 
leaf (based on the probability of the words in the con- 
stituent) and branch (based on the probability of the 
words and label of the constituent). 

There are two ways of combining the probabilities of 
constituents in the probabilistic methods: geometric 
mean and extended geometric mean. A + is added to 
the systems using the extended geometric mean. 

The alignment and selection methods can be combined 
into several ABL systems. The names of the algo- 
rithms are in the form of: alignment: selection, where 
alignment and selection represent an alignment and 
selection method respectively. 

3.2 Metrics 

To see how well the different systems perform, we use 
the three following metrics: 



NCBP = 



NCBR = 



ZCS 



EJQjl - \Cros S (O u T t )\ 
Zi\Oi\ 

^m-lCrossjT^O^ 
Ei\Ti\ 

E< CrossjOj,^) =0 
\TEST\ 



3 The geometric mean of a set of constituents ci, . . . , c„ 
is P(ci A ... A c n ) = VIl-Li P ( c ») 



Cross(U, V) denotes the subset of constituents from 
U that cross at least one constituent in V. Oi and 
Tj represent the constituents of a tree in the learned 
corpus and in TEST, the original corpus, respectively. 
(Sima'an, 1999) 

NCBP stands for Non-Crossing Brackets Precision, 
which denotes the percentage of learned constituents 
that do not overlap with any constituents in the orig- 
inal corpus. NCBR is the Non-Crossing Brackets Re- 
call and shows the percentage of constituents in the 
original corpus that do not overlap with any con- 
stituents in the learned corpus. Finally, ZCS stands 
for 0-Crossing Sentences and represents the percentage 
of sentences that do not have any overlapping con- 
stituents. 



Table 6. Results of the ATIS corpus and OVIS corpus 





Results ATIS corpus 


Results OVIS corpus 




NCBP 


NCBR 


zcs 


NCBP 


NCBR 


ZCS 


DEFAULT :INCR 

BIASED:INCR 

ALL:INCR 


82.55 (0.80) 

OZ.04 [V. 10) 

83.55 (0.63) 


82.98 (0.78) 

oo.yij (U. 1 4 J 

83.21 (0.64) 


17.15 (1.17) 

I 7 or) / -i n-i \ 

I I .oZ [L.UL ) 

17.04 (1.19) 


88.69 (1.11) 
So. 1 1 (V. i a ) 
89.24 (1.23) 


83.90 (1.61) 

84. 60 (l.LV) 
84.24 (1.82) 


45.13 (4.12) 

40.11 \o.ZZ) 
46.84 (5.02) 


DEFAULT: LEAF 

BIASED:LEAF 

ALL:LEAF 


82.20 (0.30) 
81.42 (0.30) 
82.55 (0.31) 


82.65 (0.29) 
82.75 (0.29) 
82.11 (0.32) 


21.05 (0.76) 
21.60 (0.66) 
20.63 (0.70) 


85.70 (0.01) 
85.32 (0.02) 
85.84 (0.02) 


79.96 (0.02) 
79.96 (0.03) 
79.58 (0.03) 


30.87 (0.07) 
30.87 (0.09) 
30.74 (0.08) 


DEFAULT: LEAF 4- 
BIASED:LEAF + 
ALL:LEAF+ 


82.31 (0.32) 
81.43 (0.32) 
82.55 (0.35) 


83.10 (0.31) 

83.11 (0.31) 
82.42 (0.35) 


22.02 (0.76) 
22.44 (0.70) 
21.51 (0.69) 


85.67 (0.02) 
85.25 (0.02) 
85.83 (0.02) 


79.95 (0.03) 
79.88 (0.03) 
79.56 (0.03) 


30.90 (0.08) 
30.89 (0.08) 
30.83 (0.08) 


DEFAULT: BRANCH 
BIASED: BRANCH 
ALL:BRANCH 


86.04 (0.10) 
85.31 (0.11) 
86.47 (0.08) 


87.11 (0.09) 
87.14 (0.11) 
86.78 (0.08) 


29.01 (0.00) 
29.71 (0.00) 
29.57 (0.00) 


89.39 (0.00) 
89.25 (0.00) 
89.63 (0.00) 


84.90 (0.00) 
85.04 (0.01) 
84.76 (0.00) 


42.05 (0.02) 
42.20 (0.01) 
41.98 (0.02) 


DEFAULT :BRANCH+ 
BIASED :BRANCH+ 
ALL:BRANCH+ 


86.04 (0.10) 
85.31 (0.10) 
86.47 (0.07) 


87.10 (0.09) 
87.13 (0.09) 
86.78 (0.07) 


29.01 (0.00) 
29.71 (0.00) 
29.57 (0.00) 


89.39 (0.00) 
89.25 (0.00) 
89.63 (0.00) 


84.90 (0.00) 
85.04 (0.00) 
84.76 (0.00) 


42.04 (0.02) 
42.19 (0.01) 
41.98 (0.02) 



4. Results 

Several ABL algorithms are tested on the ATIS corpus 
(Marcus et al., 1993) and on the OVIS corpus (Bon- 
nema et al., 1997). The ATIS corpus from the Pcnn 
Treebank is a structured, English corpus and consists 
of 716 sentences containing 11,777 constituents. The 
OVIS corpus is a structured, Dutch corpus containing 
sentences on travel information. It consists of exactly 
10,000 sentences. From these sentences we have se- 
lected all sentences of length larger than one, which re- 
sults in 6,797 sentences containing 48,562 constituents. 

The sentences of the corpora are stripped of their 
structure and the ABL algorithms are applied to them. 
The resulting structured sentences are then compared 
to the structures in the original corpus. 

The results of applying the different systems to the 
ATIS corpus and the OVIS corpus can be found in Ta- 
ble 6. All systems have been tested ten times, since 
the incr system depends on the order of the sentences 
and the probabilistic systems sometimes select con- 
stituents at random. The results in the table show the 
mean and the standard deviation (in brackets). 

4.1 Evaluation 

Although we argued that the alignment methods bi- 
ased 7 and all solve problems of the default 7, this 
can hardly be seen when looking at the results. The 
main tendency is that the all methods generate higher 
precision (NCBP), with a maximum of 89.63 % on the 
OVIS corpus, but that the biased 7 methods result 



in higher recall (NCBR) with 87.14 % on the ATIS 
corpus and 0-crossing sentences, 29.71 % on the ATIS 
corpus (on the OVIS corpus the maximum is reached 
with the all method) . The default 7 method performs 
worse overall. These differences, however, are slight. 

The selection learning methods have a larger impact 
on the differences in the generated corpora. The incr 
systems perform quite well considering the fact that 
they cannot recover from incorrect constituents, with 
a precision and recall of roughly 83 %. The order of 
the sentences however is quite important, since the 
standard deviation of the incr systems is quite large 
(especially with the ZCS, reaching 1.19 %). 

We expected the probabilistic methods to perform bet- 
ter, but the leaf systems perform slightly worse. The 
ZCS, however, is somewhat better, resulting in 22.44 % 
for the leaf+ method. Furthermore, the standard devi- 
ations of the leaf systems (and of the branch systems) 
are close to %. The statistical methods generate 
more precise results. 

The branch systems clearly outperform all other sys- 
tems. Using more specific statistics generate better 
results. 

The systems using the extended geometric mean result 
in slightly better results on the leaf system, but when 
larger corpora are used, this difference disappears com- 
pletely. 

Although the results of the ATIS corpus and OVIS 
corpus differ, the conclusions that can be reached are 
similar. 



Table 7. Recursion learned in the ATIS corpus 

learned What is the ( name of the ( airport in Boston jig jig 

original What is ( the name of ( the airport in Boston )np )np 

learned Explain classes QW and ( QX and ( Y )§2 J52 

original Explain classes ( ( QW )np and ( QX )np and ( Y )np )np 



4.2 Recursion 

All ABL systems learn recursion on the ATIS and 
OVIS corpora. Two example sentences from the ATIS 
corpus with the original and learned structure can be 
found in Table 7. The sentences in the example are 
stripped of all but the interesting constituents to make 
it easier to see where the recursion occurs. 

The recursion in the first sentence is not entirely the 
same. The ABL algorithm finds constituents of some 
sort of noun phrase, while the constituents in the ATIS 
corpus show recursive noun phrases. Likewise in the 
second sentence, the ABL algorithm finds a recursive 
noun phrase while the structure in the ATIS corpus is 
similar. 

5. Previous Work 

Existing grammar learning methods can be grouped 
(like other learning methods) into supervised and 
unsupervised methods. Unsupervised methods only 
use plain (or pre-tagged) sentences, while supervised 
methods are first initialised with structured sentences. 

In practice, supervised methods generate better re- 
sults, since they can adapt their output to the struc- 
tured examples from the initialisation phase, whereas 
unsupervised methods do not have any idea what the 
output should look like. Although unsupervised meth- 
ods perform worse than supervised methods, unsuper- 
vised methods are necessary for the time-consuming 
and costly creation of corpora for which no corpus nor 
grammar yet exists. 

There have been several different approaches to learn 
syntactic structures. We will give a short overview 
here. 

Memory based learning (MBL) keeps track of the pos- 
sible contexts and assigns word types based on that 
information (Daelcmans, 1995). Magerman and Mar- 
cus (1990) describe a method that finds constituent 
boundaries using mutual information values of the part 
of speech n-grams within a sentence and Redington 
et al. (1998) present a method that bootstraps syn- 
tactic categories using distributional information. 



Algorithms that use the minimum description length 
(MDL) principle build grammars that describe the in- 
put sentences using the minimal number of bits. This 
idea stems from the information theory. Examples of 
these systems can be found in Griinwald (1994) and 
dc Marcken (1996). 

The system by Wolff (1982) performs a heuristic search 
while creating and merging symbols directed by an 
evaluation function. Similarly, Cook et al. (1976) 
describe an algorithm that uses a cost function that 
can be used to direct search for a grammar. Stolcke 
and Omohundro (1994) describe a more recent gram- 
mar induction method that merges elements of mod- 
els using a Bayesian framework. Chen (1995) presents 
a Bayesian grammar induction method, which is fol- 
lowed by a post-pass using the inside-outside algorithm 
(Baker, 1979; Lari & Young, 1990), while Pereira and 
Schabes (1992) apply the inside-outside algorithm to 
a partially structured corpus. 

The supervised system described by Brill (1993) takes 
a completely different approach. It tries to find trans- 
formations that improve a naive parse, effectively re- 
ducing errors. 

The two phases of ABL are closely related to some 
previous work. The alignment learning phase is ef- 
fectively a compression technique comparable to MDL 
or Bayesian grammar induction methods. However, 
ABL remembers all possible constituents, effectively 
building a search space. The selection learning phase 
searches this space, directed by a probabilistic evalua- 
tion function. 

It is difficult to compare the results of the ABL sys- 
tem against other systems, since different corpora or 
metrics are used. The system described by Pereira 
and Schabes (1992) comes reasonably close to ours. 
That system learns structure on plain sentences from 
the ATIS corpus resulting in 37.35 % precision, while 
the unsupervised ABL significantly outperforms this 
method, reaching 86.47 % precision. Only their super- 
vised version results in a slightly higher precision of 
90.36 %. 

A system that simply builds right branching structures 
results in 82.70 % precision and 92.91 % recall on the 



ATIS corpus, where ABL got 86.47 % and 87.14 %. 
These good results could be expected, since English 
is a right branching language; a left branching system 
performed much worse (32.60 % precision and 76.82 % 
recall. On a Japanese (a left branching language) cor- 
pus, right branching would not do very well. Since 
ABL does not have a preference for direction built in, 
we expect ABL to perform similarly on a Japanese 
corpus compared to the ATIS corpus. 

6. Discussion and Future Extensions 

We will discuss several problems of ABL and suggest 
possible solutions to these problems. 

6.1 Wrong Syntactic Type 

There are cases in which the implication "if two parts 
of sentences can be replaced, they are constituents of 
the same type" , we use in this system, does not hold. 
Consider the sentences in Table 8. When applying 
the ABL learning algorithm to these sentences, it will 
determine that morning and nonstop are of the same 
type. However, in the ATIS corpus, morning is tagged 
as an NN (a noun) and nonstop is a J J (an adjective). 

Table 8. Wrong syntactic type 

Show me the ( morning )x flights 
Show me the ( nonstop )x flights 

The constituent morning can also be used as a noun 
in other contexts, while nonstop never will. This in- 
formation can be found by looking at the distribution 
of the contexts of constituents in the rest of the cor- 
pus. Based on that information a correct non-terminal 
assignment can be made. 

6.2 Weakening Exact Match 

Aligning two dissimilar sentences yields no structure. 
However, if we weaken the exact match between words 
in the alignment phase, it is possible to learn structure 
even with dissimilar sentences. 

Instead of linking exactly matching words, the algo- 
rithm should match words that are equivalent. One 
way of implementing this is by using equivalence 
classes. With equivalence classes, words that are 
closely related are grouped together. (Redington et al. 
(1998) describe an unsupervised way of finding equiv- 
alence classes.) 

Words that are in the same equivalence class are said 
to be sufficiently equivalent and may be linked. Now 



sentences that do not have words in common, but do 
have words from the same equivalence class in com- 
mon, can be used to learn structure. 

When using equivalence classes, more constituents are 
learned since more terminals in constituents may be 
seen as similar (according to the equivalence classes). 
This results in structures containing more possible 
constituents from which the selection phase may 
choose. 

6.3 Alternative Statistics 

At the moment we have tested two different ways 
of computing the probability of a bracket: leaf and 
branch. Of course, other systems can be implemented. 
One interesting possibility takes a DOP-like approach 
(Bod, 1998), which takes into account the inner struc- 
ture of the constituents. As can be seen in the results, 
the system that uses more specific statistics performs 
better. 

7. Conclusion 

We have introduced a new grammar learning algorithm 
based on aligning plain sentences; neither pre-labelled 
or bracketed nor pre-tagged sentences are used. It 
aligns sentences to find dissimilarities between sen- 
tences. The alignments are not limited to window-size, 
instead arbitrarily large contexts are used. The dissim- 
ilarities are used to find all possible constituents from 
which the algorithm selects the most probable ones 
afterwards. 

Three different alignment methods and five different 
selection methods have been implemented. The in- 
stances of the algorithm have been applied to two cor- 
pora of different size, the ATIS corpus (716 sentences) 
and the OVIS corpus (6,797 sentences), generating 
promising numerical results. Since these corpora are 
still relatively small, we plan to apply the algorithm 
to larger corpora. 

The results showed that the different selection meth- 
ods have a larger impact than the different alignment 
methods. The selection method that uses the most 
specific statistics performs best. Furthermore, the sys- 
tem has the ability to learn recursion. 
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